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Abstract 

This paper considers the derivative of the entropy rate of a hidden Markov process with respect 
to the observation probabilities. The main result is a compact formula for the derivative that can be 
evaluated easily using Monte Carlo methods. It is applied to the problem of computing the capacity 
of a finite-state channel (FSC) and, in the high-noise regime, the formula has a simple closed-form 
expression that enables series expansion of the capacity of a FSC. This expansion is evaluated for a 
binary-symmetric channel under a (0,1) run-length limited constraint and an intersymbol-interference 
channel with Gaussian noise. 
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1 Introduction 

1.1 The Hidden Markov Process 

A hidden Markov process (HMP) is a discrete-time finite-state Markov chain (FSMC) observed through 
a memoryless channel. The HMP has become ubiquitous in statistics, computer science, and electrical 
engineering because it approximates many processes well using a dependency structure that leads to 
■ many efficient algorithms. While the roots of the HMP lie in the "grouped Markov chains" of Harris 

[2"T] and the "functions of a finite-state Markov chain" of Blackwell [8J, the HMP first appears (in full 
generality) as the output process of a finite-state channel (FSC) [9j. The statistical inference algorithm 
of Baum and Petrie [5], however, cemented the HMP's place in history and is responsible for great 
advances in fields such as speech recognition and biological sequence analysis |23[ 125). An exceptional 
survey of HMPs, by Ephraim and Merhav, gives a nice summary of what is known in this area |13| . 

Definition 1.1. Let Q be the state set of an irreducible aperiodic FSMC {Q*} tgZ with state transition 
matrix P and define 
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X 



Pij = [P]i,3=Pl(Qt+l=j\Q 



I 



for i, j G Q. Let y be a finite set of possible observations and {Y t } t€Z be the stochastic process where 
Yt G y is generated by the transition from Q t to Qt+i- The distribution of the observation conditioned 
on the FSMC transition^ is given by 



Pr (Yt = y\Qt = i, Qt+i = j) if(i,j)eV 
otherwise 



for i,j G Q, where V = G Q x Q\pij > 0} is the set of valid transitions. The ergodic process 

{Y t } tgZ is called a hidden Markov process. With proper initialization, the process is also stationary . 

Although the notation of this paper assumes that y is a finite set, many results remain correct when 
y = M if hij(y) is assumed to be a continuous p.d.f. and sums over y are converted to integrals over R. 



* Henry Pfister is with the Electrical and Computer Engineering Department of Texas A&M University (e-mail: hpfis- 
ter@tamu.edu). His research was supported in part by the National Science Foundation under Grant No. 074740. 

1 In general, HMPs are defined by noisy observations of the FSMC states (rather than the transitions). This paper 
uses the "transition observation" model instead because of its natural connection with finite-state channels. Moreover, any 
random process that can be represented by the "transition observation" HMP model with M states can also be represented 
by the "state observation" model with M 2 states. 
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1.2 The Entropy Rate 

The entropy rate of a stationary stochastic process {Y t } teZ is defined to be 

H(y) 4 lim —H (Yi, . . . , Y n ) , 

n— >oo n 

where H(Yi) = —E [lnPr(Yi)] is the entropy of the random variable (r.v.) Y\ and the limit exists and 
is finite if H{Yi) < 00 [H]- Computing the exact entropy rate of an HMP in closed form appears to be 
difficult, however. In [5], Blackwell states 

"In this paper we study the entropy of the {y n } [hidden Markov] process; our result 
suggests that this entropy is intrinsically a complicated function of [the parameters of the 
hidden Markov process] M and 

On the other hand, the Shannon-McMillan-Breiman Theorem shows that the empirical entropy rate 
— — lnPr(y") converges almost surely to the entropy rate H(y) (in nats) as n — » 00. Therefore, simulation 
based (i.e., Monte Carlo) approaches work well in many cases [5U1 [T71 [Tl 1581 I5T1 [51 [2], 

Other early work related to the entropy rate of HMPs can be found in [7J [SHI [Ml [55] . Recently, 
interest in HMPs has surged and there have been a large number of papers discussing the entropy rate 
of HMPs. These range from bounds |37[ 131] 132] to establishing the analyticity of the entropy rate [18] 
to computing series expansions of the entropy rate |44[ W2\ [20] . 

1.3 The Finite-State Channel 

The work in this paper is largely motivated by the analysis of a class of time- varying channels known 
as FSCs. An FSC is a discrete-time channel where the distribution of the channel output depends on 
both the channel input and the underlying channel state [16] . This allows the channel output to depend 
implicitly on previous inputs and outputs via the channel state. In practice, there are three types of 
channel variation which FSCs are typically used to model. A flat fading channel is a time-varying 
channel whose state is independent of the channel inputs. An intersymbol-interference (ISI) channel is 
a time-varying channel whose state is a deterministic function of the previous channel inputs. Channels 
which exhibit both fading and ISI can also be modeled, and their state is a stochastic function of the 
previous channel inputs. An indecomposable FSC is, roughly speaking, a FSC where the effect of the 
initial state decays with time. The output process of an indecomposable FSC with an ergodic Markov 
input is an HMP. 

Consider an indecomposable FSC with state set S, finite input alphabet X, and output alphabet y. 
The channel is defined by its input-output state-transition probability W(y, s'\x, s), which is defined for 
all x G X, y 6 y. and s,s' e S. Using this notation, W(y,s'\x,s) is the conditional probability that 
the channel output is y and the new channel state is s' given that the channel input was x and the 
initial state was s. The n-step transition probability for a sequence of n channel uses (with input x™ 
and output y") is given by 

n 

Pr(y?=y?\X? = xV= J2 ^(Si = si)l[W(y t ,s t+1 \x t ,s t ). 

s «+l eS »+l 4=1 

When y — R, we will also use W(y, s'\x, s) to represent a conditional probability density function for 
the channel outputs. 

The achievable information rate of an FSC with Markov inputs is intimately related to the entropy 
rate of an HMP [TJ [381 1211 El H21 [22] ■ Computing this entropy rate exactly is usually quite difficult, and 
often the main obstacle in the computation of achievable rates. 

1.4 Main Results 

The main result of this paper, given in in Theorem I3.2[ is a compact formula for the derivative, with 
respect to the observation probability hij(y), of the entropy rate of a general HMP . A Monte Carlo 
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estimator for this derivative follows easily because the formula is an expectation over distributions that 
are relatively easy to sample. The formula is also amenable to analysis in some asymptotic regimes. In 
particular, Theorem 13.61 derives a simple formula for the first two non-trivial terms in the expansion of 
the entropy rate in the high-noise regime. 

In Section^ this derivative formula also allows one to consider the derivative of achievable information 
rates for FSCs. For example, a closed-form expression for the capacity of a BSC under a (0,1) RLL 
constraint is derived in the high-noise limit. Section [5] provides the mathematical background necessary 
for the later sections. 

2 Mathematical Background 

2.1 Notation 

Calligraphic letters are used to denote sets (e.g., Q,y,V) and ly(-) is the indicator function of the set 
y. Capital letters are used to denote random variables (e.g., Qt,Y t ) and matrices (e.g., M,P). Lower- 
case letters are used to represent realizations of random variables (e.g., qt,Vt); column vectors (e.g., 
7r, a, j3, u, v), and indices (e.g., i, j, k, I). The i-th element of the vector 7r is denoted 7r(z). 

The following sets will also be used: R + = {a G K | a > 0}, A = M |s| , As = {u G A \ u(q) > 5,q€ Q}, 
V = {u G A | J2 q u (q) = 1}' ' Ana Vs = As H V . We note that the symbols 7T, at G Vq are used 
interchangeably to denote distributions over Q and |Q|-dimensional column vectors (e.g., ir T P = ir T ). 
The standard p-norm of the vector u is denoted by ||u|| p — (5~lj \ u (i)\ P ) 1 ^ P an d the induced matrix norm 
is ||M|| p ^su P||u||j)=1 ||M U || p . 

2.2 The Forward-Backward Algorithm 

One of the primary reasons for the popularity of HMPs is that the forward and backward state estimation 
problems have a simple recursive structure. Let us assume that the Markov chain {Qt} te i is stationary 
and that ir G Vo is the unique stationary distribution that satisfies n T P = ir T . For a length-n block, let 
the forward state probability at G V and the backward state probability f3 t G A be defined by 

a t (i)=Pr (Qt = i\ Yt 1 = vl' 1 ) 

m = 4^^{Qt=j\y t n = vt) 

for i, j G Q. These definitions lead naturally to the recursions 

(*t+iU) = -T—^ a t{i)Pijhij{vt) 

Pt-i(i) = -t— Pt(j)Pijhij(y t -i) 

<Pt ~ 1 J6Q 

for i,j G <2, where ipt+i is chosen so that X^eq a t+i(i) = 1 an d 4>i- 1 is choser@so that J2jeQ n U)Pt-i(j) 
1 . It is worth noting that t/jt+x = Pr(Yf = yt \ Y^ 1 — y 1 ^ 1 ) and therefore we find that 

n 

-i J2 ln ^*+! = -£ lnPr ( y " = y") n ' H W nats - 

*=i 

This simple connection between the forward recursion and the entropy rate implies a simple Monte Carlo 
approach to estimating the achievable information rates of FSCs [Tl [38l [37l [3l [2], 

2 We believe this normalization for /9i_i(g) is new and it appears to be the natural choice for the problem considered in 
this paper (and perhaps in general). 
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2.3 The Matrix Perspective 

2.3.1 The Forward-Backward Algorithm 

In this section, we review a natural connection between the product of random matrices and the forward- 
backward recursions. This connection is interesting in its own right, but will also be very helpful in 
understanding the results of later sections. 

Definition 2.1. For any y £ y, the transition-observation probability matrix, M(y), is a \ Q\ X \ Q\ matrix 
defined by 

[M{y)] l3 4 Pr(T t = y, Q t+1 =j\Q t =i)= PlJ h. l0 {y). (2.1) 

These matrices behave similarly to transition probability matrices because their sequential products 
compute the n-step transition observation probabilities of the form, 

[M(y t )M(y t+1 ) . . . M(y t+fe )]^ = Pr (Y t t+k = y\ +k , Q t+k+1 =j\Q t = i). 

This means that we can write Pr(Y 1 rt = y") as the matrix product^ 

Pr(Yr = y?) = n T ^M(y t )\l, (2.2) 

where 1 is a | Q|-dimensional column vector of ones. When y — R, the above expressions are understood 
to be probability density functions with respect to the observations and the joint probability becomes 
the joint density. 

Likewise, the forward/backward recursions can be written in matrix form as 

t ajM{y t ) M(yt-i)0t 
a, 



t+1 afM(y t )l 7r^M(y t _ 1 )A 
where ir T l — 1, aj +1 l ~ 1, and ir T (3t-i = 1- We will also make use of the shorthand notation 



/ 



M(y l k )^l[M(y t ). 

t=k 

2.3.2 Contraction Coefficients 

This section summarizes some standard results on the contractive properties of positive matrices and 
their connections to HMPs. More details can be found in [4TH [2"7l 125] . 

Definition 2.2. For any two vectors u, v G Aq, the Hilbert projective metric is 

A maxi (u(i)/v(i)) u(i)v(j) . u(i)v(j) 

d (u, v) = In r~r~w — rTT = m max ~r~r~r~; = — m mm ~r~r~rr ■ 

min, (u(j)/v(j)) i,j v{i)u(j) i.j v{i)u(j) 

It is metric on _4o\ ~ where ~ is the equivalence relation with u ~ v if au = v for some a £ K+. 

Proposition 2.3. For u,v,w G Aq such that w T u = w T v, the Hilbert projective metric characterizes 
the element-wise relative distance between two vectors in the sense that, for any i £ Q, 

d M (u(i),vd)) a K ; } < 1 e - d (u, v) < d( } 

max (u(i), vyi)) 

A ( C\ UWt. !«(«) -«(»)! ^ d(u.v) l: d(u.v)<V' 

d m (u(i),v(i)) = , -rr < e"^' ' - 1 < 2d{u,v), 

mm (u(i), v(i)) 

where dj\j is a metric on R + and d m is a semi- metric on R + (i.e., the triangle inequality does not hold). 



3 Since matrix multiplication is not commutative, we use the convention that I~[" = i M(yt) = M(yi)M(y2) ■ ■ ■ M(y n ). 
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Proof. If u(k) > f(fc), then we have 

u(k)e- d ^ = u(k) min ^4 min ^4 < w(A) min ^4 < < k ), 
j u(j) i v(i) i v(i) 

where min^ < 1 because w T u = w T v. The stated results follow from u(k) — v(k) < e d ^ u,v ^v(k) —v(k). 

u(k) — v(k) < u(k) — u(k)e~ d ( u,v \ and simple bounds on e x . Both distances are clearly symmetric and 
positive definite. The triangle inequality and other properties of d,M are discussed in [43 . □ 

Lemma 2.4. For any vectors u,v,w 6 Aq such that w T u = w T v, we have 

\\u-vW, < (l - e -rf(«.«>) ]T max («(*),«(*)) < (ML + U«||i) «) 

ieQ 

II" - vllx < (e^ - l) min («(<)»«(0) < (e d(u ' v) - l) min (Hull! , ||v|| x ) . 

Proof. The expressions follow from direct calculation of ||u — v\L using the bounds in Proposition ^. 31 □ 
The following theorem of Birkhoff plays an important role in the remainder of this paper. 

Theorem 2.5 ( |40[ Ch. 3]). Consider any non-negative matrix M with at least one positive entry in 
every row and column. Then, for all u, v 6 Aq, we have 

d(Mu, Mv) < r(M)d(u, v) 
where t(M) = * + ^|^j 1/2 = r (M t ) < 1 is the Birkhoff contraction coefficient and 

(2.3) 

The following results connect our HMP definition with Birkhoff 's contraction coefficients. An FSMC 
that is irreducible and aperiodic is called primitive. Since the underlying Markov chain is primitive, the 
matrix P must have at least one non-zero entry in each row and column. 

Condition 2.6. For some 5 > 0, the joint probability of every valid transition and output is greater 
than 8. In other words, this means that Pijhij(y) > S > for all (i,j) G V and y G y. 

Under Condition 12. 6[ the matrix M{y) has exactly the same pattern of zero/non-zero entries as P 
for all y G y. Since P is transition matrix for an ergodic Markov chain, one finds that M(y) must also 
have at least one non-zero entry in each row and column for all y G y. Therefore, r (M(y)) < 1 for all 

yey. 

Definition 2.7. An HMP is said to be (e, fc) -primitive if minij \^{Vi)\ij > ^ e f° r au Vi S y. This 
gives a uniform lower bound on the probability that a fc-step transition of the HMP simultaneously 
moves between any two states and generates any output sequence y\ . An HMP is said to be e-primitive 
if there exists a k < oo such it is (e, k)-primitive. 

Lemma 2.8. An HMP is (e,k) -primitive if it satisfies Condition \ 2.b\ with 8 > fci/^g 1 /^ and P k is a 
positive matrix. Moreover, this implies that ir(i) > ke (i.e., strictly positive) for all i G Q. 

Proof. First, we note that P k positive implies there is a length-fc path between any two states. Next, 
we write 

k 
k 

> E l[iv((quqt + l))S 

«2,...? fc eQ fc - 1 t=i 

> S k , 
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where the last step follows from the fact that there is a length-fc path between any two states. Since 
5 > ke, we see that the HMP is (e, fc)-primitive according to Definition 12.71 Note that, for any u G A®, 
we have 



$>(*) [M(y<?)]..> (^u(i)j fee > HI, 

ieQ \i£Q ) 



ke (2.4) 



for u G Ao implies that Tr(i) > ke for all i G Q. □ 
Lemma 2.9. For any e-primitive HMP, there exists a ko < oo such that, for all y\ G y k and all k > ko, 

r{M(y k )) < e -2feoLfc/fcoj e _ 

Proof. From Definition 1 2. 7[ we can assume that the HMP is (e, fco)-primitive. Using the bound (|2.3|) . we 
see that 



> pyMy > Aoe 



and 



(M(y k 1 °))< 1 r J^<e-^ 



k Q e 

Since we can break any length k sequence into at least L^/fcoJ length-/^ pieces and t (M(y)) < 1 for the 
remaining pieces, we have r (M(yf)) < (e _2fc ° c ) ^ fc ^ fc °^ . □ 



2.3.3 Lyapunov Exponents 



Consider any stationary stochastic process, {Yi\ ieZ , equipped with a function, M(y), that maps each 
y G y to a matrix. Now, consider the limit 



lim — log 

n— >oo n 



where u is any non-zero vector and || • || is any vector norm. Oseledec's multiplicative ergodic theorem says 
that this limit is deterministic for almost all realizations [34] . An earlier ergodic theorem of Furstenberg 
and Kesten [14] gives a nice proof that 



lim — log 

n— >-oo ri 



7i. 



where ||-|| is any matrix norm and 71 is known as the top Lyapunov exponent. The connection with 
entropy rate is given by the fact that, for an HMP, choosing M(y) according to (|2.ip implies that 
H(y) = -j! [371 m. 



2.4 Stationary Measures 

The forward and backward state probability vectors play a very important role in the analysis of HMPs. 
These vectors, ai,/3i G Ao, are themselves random variables which often have well-defined stationary 
distributions. To illustrate the mixing properties, we exploit the stationarity of the HMP and focus on 
time zero by defining the random variables 

C/ n (z)^Pr(Q = z|rr„ 1 ) 

yn(i) = ^Pr{Qo=i\Y n - 1 ). 

It is worth noting that U n (i) is a deterministic function of yZ n and V n (i) is a deterministic function of 
j/q . The following sufficient condition characterizes some of the HMPs that have stationary distribu- 
tions. 



G 



Definition 2.10. An HMP is called almost- surely mixing if there exists a C < oo, 7 < 1, and k < 00 
such that 



Pr (d(U m ,U n ) >C 1 n ) <C 7 n 
Pr(d(V m ,V n )>Cj n )<Cj n 

for allm>n + fc>fc + l. This implies that the forward and backward recursions both forget their 
initial conditions at an exponential rate that is uniform over all but an exponentially small set of received 
sequences. 

Definition 2.11. An HMP is called sample-path mixing if there exists a a C < 00, 7 < 1, and k < 00 
such that 

d(U m ,U n ) <Cj n 

d(v m ,v n )<cy\ 

for all to > n + k > k + 1 and all received sequences y™Za y 2m . This implies that the forward and 
backward recursions both forget their initial conditions at an exponential rate that is uniform over all 
received sequences. It is easy to see that sample-mixing implies almost-surely mixing. 

Lemma 2.12. An (e, k)-primitive HMP is sample-path mixing with 7 = e _2e and C — — 21n(fce)7~ fc . 

Proof. For each yZn, the realization of U n (i) is given by 

Un (i) = Pr (Qo = i\YI* = yZi) 



n T M (yzi) 1 

First, we let w T = tt t M (yl™ -1 ) and note that (|2.4p implies that 



T T' 
W , 7T 



•w{i)-k(S) 
= In max — — — -— < In max ■ 



1 



i,j Tr(i)w(j) i,j n(i)w(j) 

when m > n + k. Next, we use Theorem 12.51 and Lemma |2"U1 to see that 

d (u m , u n ) - d(w T M {yZl ), ir T M{yZl)) d (w T , tt t ) 

< r[M{yZ l n )) ln((fc £ )- 2 ) 

< -21n(fce)e- 2L " /feJfee . 

This gives an exponential rate of 7 = e~ 2e and C = — 21n(fce)7~ fc is chosen to handle the floor function 
and constant. For the backward recursion, the proof is identical except that the constant C is smaller 
by a factor of 2 because 



w(i) 



1 T 1 



diMfyZ 1 1 )l,l)=lnmax — — < In max 

i,j w(j) i,j l 1 lk 



□ 



Lemma 2.13. A (0, k)-primitive HMP is almost-surely mixing for some 7 < 1 and C < 00 if 

«ij [M(Y 1 k )] t 



m&xE 

qeQ 



1 '-3 



< 00. 



In particular, this can be applied to HMPs with continuous observations. 

Proof. This lemma follows, with slight modifications, from the arguments in [27] . Its proof is out of the 
scope of this work. □ 
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Proposition 2.14. The joint process {QtjO:t}tez forms a Markov chain. If the HMP is almost- surely 
mixing, then the marginal distribution converges weakly to a unique stationary measure fi q (A). 

Proof. One can see this is a Markov chain by considering the following method of generating the sequence. 
At each step, we first choose qt+i according to Pq t , qt+1 , then choose y t according to h qt ^ qt+1 (yt), and finally 
compute a t+ i(-) from «((•) and yt- In most cases, this Markov chain will not have a finite state-space 
because «((•) may take uncountably many values. Of course, this process depends on the initialization 
of the first at but this dependence decays with time if the HMP is almost-surely mixing. For simplicity, 
one may assume the initialization a\ = n is used. 

To show that i$ (A) = Pr (Q = q,U t £ A) converges weakly to the probability measure ti q (A) for 
all Borel subsets i C P 0j we observe that fi q (A) is a Cauchy sequence with respect to the Prohorov 
metric. This is sufficient because the Prohorov metric metrizes weak convergence on separable spaces 
and Vq is separable [5j p- 72]. Let d(u,A) = ird v ^A d(u, v) and A s = {u G Vo\d(u, A) < 5}so that the 
Prohorov metric is given by 

dp (fj,,(jf) = in£{5 G R+\fi'(A) < \x (A s ) + S V Borel A C V } . 

Since the HMP is almost-surely mixing, we can use the fact that Pr (d(Ut+k, Ut) > CV) < f° r all 
k > 0, to see that 

$ +k \A) = Pr(Q = q, U t+k e A) < Pr (p = q,U t G A c ^) + C 7 4 - ^ (a°^) + C 7 4 . 

This implies that dp (ufq* , fi>q t+k ^ < C 7 * for all k > 0. Therefore, ^ q \A) is a Cauchy sequence with 
respect to dp and it converges weakly to some probability measure. Therefore, we can define fi q (A) to 
be the weak limit of ^ (A) . □ 

Definition 2.15. The (forward) Furstenberg measure is the unique stationary measure (when it exists) 
of the joint process {Qt, at}tez and is given by the weak limit 

Pr(Q t = 5,0(6 4)4 Vq(A), 

for any Borel measurable set A C Vq. While this does not depend on the initialization of at, one may 
assume the initialization a\ — tt for simplicity. 

Remark 2.16. This name is chosen because the measure first appears in the work of Furstenberg and 
Kifer [15] and is closely related to the work that was started by Furstenberg and Kesten [2] . 

Consistency of the a posteriori probability (APP) 

The following Lemma will be used to make connections between the measures defined in this section. 

Lemma 2.17. Let X, Y be discrete r.v.s and let the APP function be E y (x) = Pr (X = x\Y = y). Then, 
Ey(x) — Pr (X = x\Y) is a random function (due to Y ) and we have 

Pr (X = x, E Y (-) = e(-)) = Pr (MO = e(-)) e(x). 

Proof. Applying the chain rule and the definition of Ey(-) gives 

Pr (X = x, E Y (-) = e(-)) = Pr (E Y () = e(-)) Pr (X = x | E Y {-) = e(-)) 

= Pr (E Y () = e(-))Pr(X = x\Y) 
= Pr(E Y (-)=e(-))e(x), 

where the second step follows from the fact that Ey(-) is a sufficient statistic for X (e.g., X can be 
faithfully generated from Y using the Markov chain Y — >• Ey(-) —tX). □ 

Proposition 2.18. The process {at}tez forms a Markov chain. If the HMP is almost-surely mixing, 
then it converges weakly to a unique stationary measure [i{A). 
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Proof. One can see that {atjtez is Markov by considering another method of generating the sequence. 
At each step, we first choose Qt according to ctt{-), then choose qt+i according to p qt . qt+1 , then choose y t 
according to h qtl q t+1 (yt), and finally compute a t +i(-) from ctt(-) and y t . Of course, this process depends 
on the initialization of the first at but this dependence decays with time if the HMP is almost-surely 
mixing. For simplicity, one may assume the initialization u\ = n is used. 

Comparing this to Proposition 12.141 one see that we are now using at(-) as a proxy distribution for 
Q t . This works because Lemma \2 . 1 71 shows that 

Pr(a t £ A) inf 5(g) < Pr(Q t = q,a t £ A) < Pr(a t £ A) sup 5(g), 

for any open set j4CP . By making A arbitrarily small, one can force the LHS and RHS to be arbitrarily 
close. The proof of weak convergence to a unique stationary distribution as t — > oo is essentially identical 
to the corresponding proof for Proposition 12.141 □ 

Definition 2.19. The (forward) Blackwell measure is the unique stationary measure (when it exists) 
of the process {a t } ie z and is given by the weak limit 

Pr(a t £ A)^p{A), 

for any Borel measurable set A C.'Pq. From the definition of fi q , we see also that fi(A) = X^geq 

Remark 2.20. This name is chosen because this measure first appears in the work of Blackwell [8] and 
is now commonly called the Blackwell measure |18| . 

Lemma 2.21. The Radon- Nikodym derivative of the (forward) Furstenberg measure \i q with respect 
to the (forward) Blackwell measure /j, exists and satisfies 

^ (a) = Pr(Q t = q\a t = a) 

/i-almost everywhere. This implies that 

p: q (da) — a(q)p,(da). 

Proof. First, we note that p,(A) = X) g eC A t ?(^) implies that p, q is absolutely continuous w.r.t. p. There- 

d// 

p q (A) Pr(Q t = q,a t £ A) 



fore, the Radon-Nikodym derivative ^p 2 - exists. Since 



fi(A) Pr(a t £ A) 



= Pr(Q t = q\a t £ A), 



the first result can be seen by choosing A to be arbitrarily small. The second result holds because cx-t(-) 
is the APP estimate of Q t given Y 1 ^ and this (e.g., see Lemma f2. 17|) implies that 

Pr(Q f = q\a t = a) = a(q). 

□ 

Theorem 2.22 ([S]). In terms of the Blackwell measure, the entropy rate (in nats) of an HMP is 

HQ>) = - [ /i(da)^a T Af( 2 ;)lln(a T M(j / )l). (2.5) 

Proof. Consider the sequence H(Yt\Y* ) for any stationary process. This sequence is non-negative and 
non-increasing and therefore must have a limit. Moreover, the entropy rate 

n 

H(y)± lim —H(Yi, . . . , Y n ) = lim - V H (Y t \ Y^ 1 ) 

n— >oo 71 n— foo n * — ' 

t=l 
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is the Cesaro mean of this sequence and must have the same limit. Next, we note that 



:f M(y)l = a t(i)Pi,iKM =Pr (Y t = y\ Y^ 1 ) . 



i,jeQ 



Therefore, (|2.5p is simply the expression for lim^oo H(Y t \Y* 1 ). 



□ 



Once again, this time in reverse... 

One can also reverse time for these Markov processes so that {Qt, f3t}tez forms a backward Markov chain. 
Starting from q t and working backwards, one first chooses qt-i according to Pi(Q t ~i = qt-i\Qt = qt) = 
Pg t _i,g t 7r 9t-i/ 7r 9f Then, one generates yt~i according to h qt l-qt (y t -i) and computes @t-i from f3 t and 
Vt-x- 

This process also depends on the initialization of the first /3 t but this dependence decays with time 
if the HMP is almost-surely mixing. For simplicity, one may assume the initialization f3\ = 1 is used. 
If the HMP is almost-surely mixing, then the joint distribution of Qt,Pt converges weakly to a unique 
stationary distribution as t — > — oo; the proof is very similar to the corresponding part of the proof of 
Proposition ^. 141 This allows us to define the stationary distribution of the backwards state probability 
vector. 

As with the forward process, we can reduce the state space to {(3 t }t£Z- At each step, one chooses 
qt according to Pr(Q t = q t ) — /3 t (qt)ir qt , then continues as described above to generate with qt-i, yt-i, 
and /3t-i- Let B C ju g „4 | ir T u = 1} be any open measurable set. Then, using p t {q)^q as a proxy 
distribution for Q t works because Lemma 12.171 shows that 



and choosing B arbitrarily small allows the LHS and RHS to be made arbitrarily close. This process also 
depends on the initialization of (3 t , but if the HMP is almost-surely mixing, then it converges weakly to 
a unique stationary distribution. 

Definition 2.23. The backward Furstenberg measure, is the unique stationary measure (when it exists) 
of the backwards process {Qt, (3t}tez and is given by the weak limit 



Definition 2.24. The backward Blackwell measure, is the unique stationary measure (when it exists) 
of the backwards process {f3 t }tez an d is given by the weak limit 



Lemma 2.25. The Radon-Nikodym derivative -iA of the backwards Furstenberg measure v q with respect 
to the backwards Blackwell measure v exists and satisfies 



PrOSt e B)n(q) inf /3(g) < Pr(Q t =q,/3 t €B)< Pr(& e B)n(q) sup /3(g), 
Pes p eB 




Pr(A e B) 



w 



u(B), 




— q - (fi) = Pr(Q t = g|A = P) 



v-almost everywhere. This implies that 



u q {d[3) 



7r(g)/3(gMd/3). 
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Proof. First, we note that v(B) = X) 9 gQ implies that v q is absolutely continuous w.r.t. v. There- 



fore, the Radon-Nikodym derivative exists. Since 

Pr(Qt = q\Pt G B), 



v q {B) _ Pv(Q t = q,/3 t G B) 



v{B) Pr(/3 t G B) 

the first result can be seen by choosing B to be arbitrarily small. The second result holds because (3 t {-) 
is the APP estimate of Qt given Y t °° and this (e.g., see Lemma T2.17P implies that 

Pr(Qt = q\0t = j8) = tt(«)j9(«). 

□ 



3 Taking the Derivative 
3.1 The Derivative Shortcut 

In this section, we introduce a shortcut often used in the statistical physics community. It was introduced 
to the author by Measson et al. in [28, 29J. It has also been applied to the problem under consideration 
by Zuk at al. in P||T2]. 

Let D C R be a compact set and g n : D n — > R be a sequence of functions which essentially depend 
on a single parameter 9 G D in n different ways. Abusing notation, we also let g n : D — > R be the same 
function where this dependency is combined so that g n {9) = g n {9, . . . , 9). The total derivative of g n can 
be written as 



^«(*) = £^(^---A) 

2 — 1 



»)=(0,..,e) 



This motivates us to define 



n 

g' n (9 1 ,...,9 n )^Y,Q0-9n (Ol,...,e n ). 

i—1 

Since the abuse of notation is habit forming, we will also define g' n {9) = g' n (9, . . . ,9). 

The focus on this paper is the limit of these functions as n goes to infinity, so a few technical details 
are required. If g n (0) — > f{0) uniformly over 9 G D and lim Jl _ i . 00 g' n (6) converges uniformly over 9 G D 1 
then it follows that f'(9) — lim n - > . 0o g' n (d) One might assume that it is necessary to prove uniform 
convergence for both of these sequences, but the following standard problem in analysis shows that 
suffices to consider only the sequence of derivatives. 

Lemma 3.1. Let g n : D — > R be a sequence of functions that are continuously differ entiable on a compact 
set Del. If g n {&o) converges for some 9q G D and g' n {9) converges uniformly on D, then the limits 

f{9) 4 lim 9n (9) 

n—^oo 

f'(9) 4 lim g' n (0)- 

n—>oo 

both exist and are uniformly continuous on D. 

Proof. First, we note that each g' n {9) is uniformly continuous because D is compact. Since g' n {6) con- 
verges uniformly, we find that f'(0) exists and is uniformly continuous (and hence bounded) on D. 
Interchanging the limit and integral, based on uniform convergence, implies that 

lim [g n (9) -g n (e )} = lim / g' n {x)dx = / lim g' n (x)dx = / f(x)dx = f(9) - f(0 o ). 

n— yoo n— foo f Q fa n— too l a 

This implies that g n {&) converges to f(9). Finally, we note that f(9) is uniformly continuous on D 
because f'{6) exists and is bounded on D. □ 
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3.2 Warmup Example: The Derivative of the Log Spectral Radius 

The spectral radius of a real matrix M is defined to be 



p(M) 4 lim ||M n || 1/n 

n— too 



lim -log||M r ' 



for any matrix norm. Likewise, the log spectral radius (LSR) of a real matrix M is given by 

lnp(M) " ' 

for any matrix norm. Moreover, if M has non-negative entries, then 

\np(M) = lim - log (u T M n v) 

n—>oo n 

for any vectors u, v G Aq. 

Let Mg be a mapping from a compact set £> C R to the set of non-negative real matrices. Assume 
further that Mg has a unique real eigenvalue Ai of maximum modulus (i.e., the 2nd largest eigenvalue 
A2 satisfies | A2/A1 1 < 7 < 1) for all 9 G D. Using the shorthand notation M = Mg* for 9* G D, we let 
a, b G A be left/right (column) eigenvectors of M with eigenvalue p(M); they satisfy et T Af = p(M)a T 
and M6 = p(M)b. In this case, it is known that the derivative of the LSR is given by 

a T M'b 



a T Mg*b' 



where M' = Mg* is the element-wise derivative defined by [M£] y = ^ [M e ]^.. Of course, one must 
assume that M' exists and satisfies \\M'\\ < 00. 

One can prove this by applying the derivative shortcut to f{9) — log p(Mg) using 

g n (6 u ...,6 n ) = ±ln ^ T (f[ M M 

for any vectors u, v G Aq- Based on Lemma [5TT1 we focus on g' n (9) by writing 



d 1 



= \tw^{ uT (n ^) m) [_n «) 



»r=(e«,...fl*) 



^E 



E 



M T M i_1 M' M n ~ i v 



>»=(e*...e*) 



2—1 



where we have used that 



^x T Mgy = Xk^j [M e ] kJ yi = x T M'gy. 



k.l 



Since Mg satisfies | A2 / Ai | < 7 for all 9 G D, it follows that 

u T , r\( i— 1^ 

IW-MI =a + ° (7 } 



6 + O (7" 
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Treating the boundary and interior terms, in the sum, separately gives 



L(hm) 2 j \\M'\\(a T b) \ 1 a T M'6 + 0( 7 ^™) 2 )||M'| 



u T b a T v 

u\\ \\v\\ , 



E 



l= L(lnn) 2 J + 



a T Mb + 0(^ n ) 2 ) \\M\\ 



Therefore, g n (6) and g' n (9) converge uniformly for all 6 € D and we find that 

a T M'b 



fie*) = 



a T Mb 



3.3 The Derivative of the Entropy Rate 

Let Mg (y) be transition observation probability matrix of an HMP, which depends on the real parameter 
6, and let it be the stationary distribution of the underlying Markov chain. To compute the derivative 
of the entropy rate, we define 



1 £ n T (f[M ei ( yi ))l.ln 



(6 lt ...,0 n )=(e*,...6*) 



This implies that f(9) = lim„_ i . 00 g n (6) = H(y; 6) in nats. 

Theorem 3.2. Let D Cl be a compact set and assume that -^tt = and M' g (y) = 4zMg(y) exists for 
alld&D. Then, if the HMP is well-defined and e-primitive for all G D , then f '(9*) = (3^; 
equals 



Ao 



(da) / u(d/3) J2 a T K* (y)P In (a T ' M e , (y)0) 



A 



(3.1) 



yey 



where [i and v are the forward /backward Blackwell measures of the HMP at 9 = 9* . Moreover, f{9) and 
f (9) are uniformly continuous on D. 

Proof. The following shorthand is used throughout: n t (q) = Pr(Q t — q), M(y) = Mg*(y), M'(y) = 
M' g ,(y), and M(yj) = Ylt=j {ytj- For the HMP to be well-defined, the transition matrices must 
satisfy J2 y ey ^Vf© ( J/) 1 = 1 and J2 y ey M'g(y)l = for all 9 G D. It follows that, for any u G Vo, one has 



_d_ 



u T [Y[Me t {Vt) 1 = 1 
y «e;y*> Vt=i / 



(3.2) 



3 y n e yn \ t=1 

Based on Lemma l3.1[ we note that the entropy rate exists for all 9 G D and focus on the derivative 



(a) 1 



d 



3=1 3 y?ey n \*=l / \ 

3=1 J j/fe^™ \t=i / 



Cjirl \J[Mg t (y t )j 1 
C3K1 (f[M e Jy t )) 1 



lnC, 



(c)_l 

n ^— ' 89 



j=i j y?ey 



^M{y{- L )M 0] { yj )M(y^ +l )l 
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where (a) holds for arbitrary positive values C\,. . . ,C n , (b) follows because (|3.2[) implies the InCj gives 



no contribution if -§g-Cj = 0, and (c) follows from choosing 



C 3 = Pr (YT 1 = Iff 1 ) Pr (^+1 = tf+i) = (t^M^" 1 )!) (*J +1 M(tf +1 )l) . 

One subtlety is that Hj+i = n 3 J2 y ey (v) * s affected by 0j. So, small changes in 0j cause small 
changes in ftj+i and we must add the condition -^n = to guarantee that -gg-Cj — 0. After adding this 
condition, we may safely assume that ttj = n for j = 1, . . . n. See Remark 13.31 for more details. 
For Borel measurable sets A C Aq and B C jit e Aq 1 7r T u = l}, the sets 



^ T M(y")l 



will be used to define the measures ^\A) = Pr (y(~ X E Uj(A)\ and u^(B) = Pr (YJ 1 6 V,(B)) for 

the forward/backward state probabilities. In this case, u^'(-) are probability measures on Aq for 

the random variables aj,(3j. Using these measures, we find that g' n (0*) is given by 



aJ-7r T M(yl- 



E^: E ^wr 1 ) in 



<9 



= --Y I /i (i) (da) / i/ 
= - - E / ^V") / ^°' +1) (d/5) £ [« T ^'(^)/31n (a T Af(^)/3) + a T M'(%)/3] • 

71 3 = 1 JA JAo y . e y 

All that is left is to compute the sum. If the HMP is almost-surely mixing, then the results of 
Section COl show that measures converge weakly (i.e., (Jj> — > /i and i/w) — >• ^). Moreover, Lemma rA.2l in 
Appendix IA. II shows that the convergence rate is exponential. Therefore, most of the terms in the sum 
have essentially the same value. Like the LSR, we neglect terms within (Inn) 2 of the block edge because 
their contribution is negligible as n — > oo. The exponential convergence of the stationary measures also 
shows that the interior terms become equal at the super polynomial rate <y( lnTl ) = n lnn-ln7_ Therefore, 
f n (0) and f' n {&) converge uniformly for all 9 S D and 

lim [ //^(da) / ^ +1 Hdf3) V [a T M'(y^\n(a T M(y 3 )f3)+a T M'( yj )(3} 



converges to 



Ao 



u(da) / i/(djflF) 2 [<* T M'(j/)/3 In (a T M(y)/3) + a T M'(y)/3] 



^0 



Finally the last term in (|3.3[) is shown to be zero in Lemma [3T4 



(3.3) 



□ 



Remark 3.3. The necessity of the condition ^7r = in Theorem 13.21 can be a bit subtle. This is 
because the 7r-term in many equations (e.g., ir T M (y^ +1 )l) actually represents the state distribution at a 
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particular time (e.g., time j + 1). The indices are dropped after the first few steps because the underlying 
Markov chain is stationary and the state distribution is independent of time. For example, the proof 
liberally uses the assumption that 

Pr (y? +1 = y ? +1 ) = Pr (Qi+i = «) Pr (Q^+i = q\ yp+i = Vj+i IQi+i = q) = ttM (y? +1 ) 1, 

q,q'eQ 

where the last step clearly requires that Pr(Q J+ i = q) = n{q). Moreover, this is not simply a problem 
with the proof. The author has applied the formula from Theorem 13.21 to a Markov chain (where the 
true entropy-rate derivative is well-known) and shown that the two expressions become equal only if 

Lemma 3.4. The following properties of the forward/backward Blackwell measures will be useful: 

= 1 

f //(da) Y ^ T M{y)P = 1 

yey 

f v{&P) a T M(y)/3 = 1 

J Ao y& y 

[ M (da) / i/(d/9) a T M'{y)(3 = 

JAo JA a yey 

Proof. The proof is deferred to the appendix. □ 



3.4 Behavior of the Entropy Rate in the High Noise Regime 

Suppose the domain of 9 includes a "high noise" point 9* where the channel output provides no informa- 
tion about the channel state. In this case, the forward/backward Blackwell measures become singletons 
on 7r, 1 and the entropy rate H (y>; 9) converges to the single-letter entropy H(Y; 9) as 9 — > 9* . In the 
high-noise regime, one can also evaluate the derivative from Theorem l3.2l in closed form and extend the 
formula to the 2nd derivative. In this section, we compare the expansions of H(y>; 9) and H(Y; 9). 
First, we consider the single- letter entropy 

H(Y; 9) = -J2 Pl ' ( Y t = v) l°g(Pr {Yt = !/)) 

yey 

= -£ 7r T M e (y)llog (-K T M e {y)l) , 

yey 

where tt is the stationary distribution of the underlying Markov chain as a function of 9. 

Lemma 3.5. Under the assumption that -^n = for all 9 G D, the 1st derivative w.r.t. 9 of the 
single-letter entropy is given by 

^H(Y; 8) = -J2 T T M^)llog (7r T M e (y)l) , 

yey 

Under the same assumption, the 2nd derivative w.r.t. 9 is given by 

*> = " £ KtmIm? - £ - T <(^)1 (/%)!) • (3.4) 

yey yey 
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Proof. In particular, the 1st derivative is given by 



d H(Y; 9) = ~Y1 ^M e (y)l log {n T M e (y)l) 

yey 



= - £ ^M^llog (^ T M e (y)l) 

because -^ir — and ^2 ye y tt T ' Mg{y)l = 1 for all 0. Since ^7r = for all 8 G D, the 2nd derivative is 
given by 

^H(Y; *) = -i £ t T M^)1 log (7r T M e (y)l) 

= -^M^llogKM^)!) - £ ^^f - 
yey yey n m ow x 

□ 

Now, we consider closed form evaluation of Theorem 13.21 Since the first derivative is often zero at 
8 = 8*. we are fortunate that a new formula for the 2nd derivative can also be evaluated in closed form. 

Theorem 3.6. If there is a function s(y), a 9* G D, and a matrix P such that lining. M(y) = s{y)P 
for all y G y, then 

d H(y-,e)\ g=g . = -5> T M'(2/)1 \n(s(y)) 
yey 



(18 

and 



2 



= - £ ^M»(y)l In (,(„)) - £ (^ff . (3.5) 

yey yey vy; 



Proof. The proof is deferred to the appendix. □ 



3.5 HMP Example: A Binary Markov-1 Source with BSC Noise 

Consider the HMP defined by a binary Markov-1 source observed through a BSC(e). The two-state 
Markov process is defined by Pr(Q t+1 = j | Q t = i) = with stationary distribution Pr(Q f = i) = ir(i), 
and 7r(0) = 1 — 7r(l) = 2-pol-pn • ^ ne output of the HMP is simply the observation of state through a 
BSC or more specifically 

J 1 - e if y = i 
I e otherwise 

The entropy rate of this process was considered earlier using a range of techniques |3 1 1 f32 | IT8l 144] . Now, 
we will consider the entropy rate of this process as e — > h (i.e., in the high- noise regime). This special 
case was also treated earlier and very similar results were obtained using different methods in [SOI Q32 [33J . 

Since we are interested in the high-noise regime, we start by analyzing the system using the upper 
bound H(y) < H{Y). This gives 

H(Y) = - £ Pr(F = y) In (Pr(Y = y)) , 

yey 

where 

Pr(F = 0) = tt(0)poo(1 - e) + 7r(0)p i£ + 7r(l)pi (l - e) + 7r(l)pu£ 
Pr(F = 1) = 7r(0)poo£ + 7r(0)poi(l - e) + 7r(l)p 10 e + 7r(l)p u (l - e). 
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Using the Taylor expansion of H(Y; 9) around 9 = | — e, we find that 



H(y) < H(Y;6) = In 2 - 



4(pgp-Pn) 2 
(2-poo-pn) 2 2 

To calculate this expansion exactly for H(y), we apply Theorem 
are satisfied because 



O (0 4 ) . (3.6) 
The conditions of the Theorem 



M e (y) 



Poo(l - e) Poi£ 
pio(l-e) pn£ 

Poo£ Poi(l-e) 
Pi e pn(l-e) 



if y = 
if y = 1 



Poo 


Poi 


+ e 


Poo 


-poi 


PlO 


Pll 




PlO 


— Pll 


Poo 


Poi 


- e 


Poo 


-Poi 


PlO 


Pll 




PlO 


-Pll _ 



ify = 
ify=l 



implies M$(0) = Mg(l) at 9 = (i.e., e = |). Computing (|3.5p . which is simplified by the symmetry of 
Mg(y) and the fact that Mg(y) is the zero matrix, gives 



d<9 2 







1— Pll 


1— poo 




Poo 


-poi 




1 " 


r 


-J 


2— poo— Pii 


2-poo-pn 




PlO 


-Pll 




1 




9=0 


1 


1-Pll 


l-Poo 






Poo 


Poi 




" 1 






2 


2— poo— Pii 


2-poo-pn 




PlO 


Pll 




1 





4(Poo"Pii) 



(3-7) 
(2 -p 00 -pnY 

Since H(y; 0) = In 2, this implies that the upper bound is tight with respect to the first non-zero term 
in the high-noise expansion. 

3.6 Example 2: A Conditionally Gaussian HMP 

Consider an HMP where the output distribution, conditioned on the state of underlying Markov chain, 
is Gaussian. Suppose that the Gaussian associated with the transition from state i to state j has mean 
9 ■ rriij and variance 1, then this implies that hij(y) — ^=e _ ^ _6 ' mi ^ I" 1 . Since the HMP loses state 



dependence as 9 — > 0, we first consider the derivatives w.r.t. 9 of the single-letter entropy 

/oo 
7r T M e (j/)llog (Ti T Mg{y)i) Ay. 
-oo 

In this case, the stationary distribution does not depend on 9 so translating Lemma 13.51 to the 
continuous alphabet case gives 

d f 00 
H(Y; 9) = - lim / n T M' e (y)l log (n T M B (y)l) dy 



A9 



1=0 



= — lim 



00 ijeQ \k,ieQ 

/oo 
-°°i,jeQ 

/oo r 



*(k)Pkl -(y-em kl ) 2 /2 
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because the odd moments of a zero-mean Gaussian are zero. Likewise, the formula for 2nd derivative 
(|3.4p can be translated into 

d 2 f°° f°° (n T M' (v)l) 2 



H i 

The second term T2 of the expression for ^)| 9 _ is given by 



T 9 



lim 



Eijec ^e" 



(y-em i3 ) 2 /2 T 



Ei, 7 - e oT(i)Pii^e- te - em « )2/2 



/2 



/2tt 
2 



-r/2,,2 



Using the fact that 



we can write the first term T\ of the expression for -¥-^H(Y 

/oo 
n T M' e '(y)llog (ir T M e (y)l) dy 
-00 



0=0 



r E ^)p.^ e - 92/2 <(M)log(^e-» ! ^d» 

/CO 
e -, 2 /2 (y 4 _ y2) dy 
-OO 



(a) 1 

2 



where (a) follows from the fact that the 4th moment of a standard Gaussian is 3. 

Comparing Lemma |3. 51 with Theorem 13 . 61 shows that the first two terms in the expansion of H(Y; 
match the first two terms in the expansion of H(y; 0) at 9 = 0. Therefore, we have 



d 2 



>=° i,j£Q 



(3.8) 



4 Application: High-Noise Capacity Expansions for FSCs 
4.1 The Derivative of Capacity for an FSC 

Now, we will use the previous result to compute the derivative of the capacity. The mutual information 
I(X;Y) between the r.v.s X and Y is defined by I(X;Y) = H(Y) - H(Y\X), where the conditional 
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entropy is defined by H(Y\X) = H(X,Y) — H(X). Since the mutual information depends on the 
input distribution, the capacity is defined to be the supremum of the mutual information over all input 
distributions . Therefore, some care must be taken when expressing the derivative of the capacity in 
terms of the derivative of the mutual information. 

Consider a family of FSCs whose entropy rate is differentiable with respect to some parameter 9. 
Let the input distribution be Markov with memory m (e.g., defined by the vector P containing |.Y| m+1 
values) and the optimal input distribution be P(6). In this case, we let the mutual information rate be 
1(6, P) and the Markov-m capacity be C{6) = 2 (e, P(6^ 



Lemma 4.1. The derivative of the Markov-m capacity is given by 

A C (0) = A x [e, P(p)) = i' e (e, Pyj) , (4.i) 

where I' g \ 6,P(ff)^ is the derivative (w.r.t. 9) of the mutual information rate evaluated at the capacity 
achieving input distribution for 9. 

Proof. Expanding the derivative of C(9) in terms of I' e (j), P^j and the gradient vector I' P (j), P^j (w.r.t. 
input distribution), gives 

di (e, P) = i' e (e, P) de + n P (o, P) ■ aP. 



The optimality of P(9) implies T P [9,P(9)j ■ dP = for any dP satisfying dP ■ 1 = (i.e., the sum of 

P(ff) is a constant). So, the derivative of the capacity is the derivative of the mutual information rate 
and we have (14.11). □ 



d9^ 



Corollary 4.2. If there is a "high noise" point 9* € D where the Markov-m capacity satisfies C(9*) = 
and C'{9*) — 0, then 



where Xg ^9,P(9)^ is the 2nd derivative (w.r.t. 9) of the mutual information rate evaluated at the 
capacity achieving input distribution for 9. 

Proof. First, we write the 2nd derivative as 



lim —XL ( 9 : P(9) 



= T e ' [9\P{9*) \ + lim 



• p-( 

P=P(6*) 



Now, recall that I' g (d* ,P(6*)\ = and suppose that the 2nd term is positive. In this case, a small 
change in P in the direction P'(9*) must give an I' g {o* ,P^ > 0. But, this contradicts the fact that 



= C'(9*) > m&xl' e (9*,P 
Therefore, the 2nd term must be zero. □ 

If the domain of 9 includes a "high noise" point 9* where the channel output provides no information 
about the channel state, then Theorem 13.61 shows that the first two ^-derivatives of the entropy rate 
H(y;9) can be calculated at 9 = 9*. In fact, one also sees that they match the first two ^-derivatives 
of the single-letter entropy H(Y;9) at 9 — 9*. Using Lemma [4.11 and Corollary 14.21 we see that these 
derivatives also equal the derivative of the Markov-m capacity in this case. But this equality holds for 
all m, so we can take a limit to see that it must hold also for the true capacity [10] . Even without this, 
however, we can use the fact that H{y; 9) < H(Y; 9) to upper bound the maximum entropy rate over 
all input distributions. 
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4.2 FSC Example: A BSC with an RLL Constraint 



Consider the FSC defined by the BSC(e) with a (0,1) run-length (RLL) constraint [25]. This is a standard 
binary symmetric channel with a constraint that the input cannot have two Is in a row (e.g., this requires 
a two-state input process). The two-state input process is defined by Pr(X t+1 = j \ Xt = i) = Pij with 
Pll = 0, Pr(X t = i) = n(i), and tt(0) = 1 - tt(1) = 

The mutual information rate between the input and output satisfies 

i{x-,y) = H(y) - H{y\x) 

< H{Yi) - h(e), 

where h(e) = — e lne — (1 — e) ln(l — e) is the binary entropy function in nats. Now, we can let 9 = \ — e 
and combine the entropy-rate expansion from (|3.6I) with the fact that h(h — 9) = In 2 — 29 2 + 0(9 4 ). 
The resulting high-noise expansion for the upper bound is 

Notice that the leading coefficient achieves a unique maximum value of at p o = 0. Since this upper 
bound only depends on the single-letter probabilities, it cannot be increased by extending the memory 
of the input process. 

To see that this rate is achievable, we apply Theorem l3.6l to our system. Taking the result from (|3.7p . 
we find that 



i(x-,y) = H{y) - H{y\x) 



2pl 



l-(^^^ + «(^ 2 )J-(l-2^ + 0(^) 

= 8(1 -pop) 

(2 -poo) 2 1 ' 

So the leading term of the actual expansion matches the upper bound. 

From a coding perspective, this result implies that that we should choose our Shannon random 
codebook to be sequences with mostly alternating 01 patterns and an occasional 00 pattern (i.e., occurs 
with probability poo — > 0). It is also worth mentioning that this constraint costs nothing when the noise 
is large because the slope of the expansion matches the slope of the unconstrained BSC as poo — > 0. 



4.3 FSC Example: Intersymbol-Interference Channels in AWGN 



Consider a family of finite-memory ISI channels parametrized by 9. Let the time-i output Y t be a 
Gaussian whose mean is given by 9 times a deterministic function of the current input and the previous 
k inputs. Under these conditions, the output process is a conditionally Gaussian HMP, with state 
Qt = (Xt-i, . . . ,Xt-k)i as defined in Section l3~6l Moreover, the conditional entropy rate H(y\X) only 
depends on the noise variance, which can be taken to be 1 without loss of generality. Therefore, 9- 
derivatives of the mutual information rate, I(X; y) = H(y) — H(y\X), depend only on ^-derivatives of 
the entropy rate H(y). 

Let the mean of the output process induced by a state transition Q t — i to Qt+i = j be rriij. One 



can explore the high-noise regime by keeping the noise variance fixed to 1 and letting 
case, one can combine ()3.8|) and Corollary 14. 2l to see that 



0. In this 



C(0) = — 



The first term in this expansion can be optimized over the input distribution p^, but there are a 



few caveats. Let ey = n(i)pij be the edge occupancy probabilities that satisfy J2i 



1, then 
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stationarity of the underlying Markov chain implies that X^( e ij ~ e ji) = 0- One also finds that not 
all state transitions are valid, but setting = if V gives the following convejfl optimization 

problem with linear constraints: 

maximize e^mfj — I e.ijm 
i,jeQ \i,jeQ 

subject to eij = 1 

^2{etj - e^) = Vi 

A similar result is given in [41] for linear ISI channels with balanced inputs (i.e., a zero- mean input). 
In this case, the ^ e^m^ term is zero and the optimization problem is reduced to finding the maximum 
mean- weight cycle in a directed graph with edge weights mfj . The formula above generalizes the previous 
result to non- linear ISI channels and eliminates the zero-mean input requirement. 



5 Connection to the Formula of Vontobel et al. 

The results of this paper are closely related to an observation by Vontobel et al. H2] that the first 
part of generalized Blahut-Arimoto algorithm for FSCs actually computes the derivative of the mutual 
information. Their result is somewhat different because it considers derivatives with respect to the 
edge occupancy probabilities n(i)pij rather than the observation probabilities. Their approach is also 
dissimilar because the answer is given exactly for finite blocks rather than focusing on the asymptotically 
long blocks and the forward/backward stationary measures. Moreover, the result in this paper does not 
apply to changes in the HMP which change the stationary distribution tt of the while the derivative 
result in [42J focuses exclusively on changes in the edge occupancy probabilities. 

Ideally, one would have a unified treatment of the derivative, with respect to changes in both the 
edge occupancy probabilities n(i)pij and the observation probabilities, of the entropy rate of a FSC. 
Indeed, a simple formula, in terms of forward/backward stationary measures, can be cobbled together 
by translating the derivative formula in [42j to stationary measures and combining this with Theorem 
13.21 To clarify the connection, their result is shown first in terms of conditional density functions for 
a and (3. Paraphrasing their result, in terms of the derivative of the edge occupancy probabilities 

One can decompose this formula to see that the term A^- gives the change in the edge occupancy prob- 
ability, the term f a \Q t (a\i)fp\Q t+1 (p\j)f Y \Q t Q t+1 (y\i,j) is the probability of a,j3,y given the transition, 
and the logarithmic term gives the contribution to H (Qt+i = j\Qt — h ^^o) f° r this a , V- 
Next, we modify this expression to use unconditional a, f3 distributions with 

^W)| 9=0 - - E q A. J AaxAo _ • _ £ M y) In ^ amtk{y)m 



4 The objective function is actually concave, but one can negate the objective and minimize instead. 
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where (a) holds because is the conditional density of a given the true state is i and is the 

conditional density of /3 given the true state is j, (b) follows from Lemmas 12.211 and 12.251 and (c) follows 
from M i:j (y) = p l3 h l3 {y). Finally, using H(y- 9) = H(X; 9) - H(X\y; 9) + H(y\X; 9) and 

±H(X;9)\^ Q = - £ X,\n P ,, 



ijeQ 



d9 

1 ij'eQ yey 



we find that -^H(y; 9)\ e _ Q is given by 
£ A y / A«(da)i/(d0) £ 



It is straightforward to combine this Theorem 13.21 though the final expression is even more unwieldy. 



i,jeQ -^oxA) yGi , 



6 Conclusions 

This paper considers the derivative of the entropy rate for general hidden Markov processes and derives 
a closed- form expression for this derivative in high- noise limit. An application is presented relating to 
the achievable information rates of finite-state channels. Again, a closed-form expression is derived for 
the high-noise limit. Two examples of interest are considered. First, transmission over a BSC under a 
(0,1) RLL constraint is treated and the capacity-achieving input distribution is derived in the high-noise 
limit. Second, an intersymbol interference channel in AWGN is considered and the capacity is derived 
in the high-noise limit. 
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A Technical Details 

A.l Lemmas for Theorem 13.21 

Lemma A.l. Consider function F(a,j3) = — a T M '/? log (a T M/3) where M is a non-negative matrix 
and M' is a real matrix. This function is Lipschitz continuous w.r.t. ||-||, on (a, /3) G Vs x Bg where 
Bg = {it 6 As 1 7r T /3 = l}, ij = minj 7r(«) > 0, and S > 0. This implies that 

|F(a,/3)-F(a',/3)| <L a ||a-a'|| 1 
\F{a,P)-F{a,0')\<Lp\\0-l3'\\ 1 
\F(a,[3) - F(a', < L a \\a - a\ + L p \\[3 - 0\\ x , 

where c = S 2 J^i 3 ^ij an d 

L a HlMUi -log- + ||M'|| 1 ||M|| 1 4" 
r\ c rj^c 

Lp = \\M\\ M log± + \\M'\\ 00 \\M\\ 00 ±. 

Proof. Let G : W 11 —> M be any function that is differentiable on a convex set D C R m . Then, the mean 
value theorem of vector calculus implies that 

G(y) - G(x) = G' (x + t(y - x)f (y - x) 
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for some t £ [0, 1]. Applying Holder's inequality allows one to upper bound the Lipschitz constant w.r.t. 
|| • || 1 and gives the upper bound 

G(y)-G(x)< sup WG'ix + tiy-xm^Wx-yW, 

t£[0,l] 

< sup HG'C^IU Ik-ylU . 
zeD 

Since F(a,f3) is differentiable w.r.t. a, we can bound the Lipschitz constant L a with 

1 r U T MP 



L a = sup sup sup 
aev s pes \\u\\ x <i 

(a) 

< sup sup sup 

o£P f pes iimII <i 



u T M'/3\og 



a T M/3 



a T M'/3 



\u T M'(i\ log - + |a T M'^| \u T Mj3\ - 
c c 



a T M/3 

T 



< \\M\\ X H/3H! log ~ + HM'lli \\p\\! \\M\\ X - c 
<||M|| 1 llog- + ||M'|| 1 \\M\\^, 

Tj C 1] Z C 



(A.l) 



where (a) follows from a T M/3 > c with c = <5 2 X«j M y> (&) follows from |x T My| < H^H^ HM^ Hy^, 
and (c) follows from H/?^ < r\~ 1 which holds because 7r T /3 = 1. 

Likewise F{a,fi) is differentiable w.r.t. /? and we can bound the Lipschitz constant Lp with 



Lp — sup sup sup 

aev 5 peB IMI^i 

(a) 

< sup sup sup 

aeV 5 0GB llull <1 



a T M'u log 



1 



a T M/3 



a T M'/3 



a 1 Mu 



\a T M'u\ log - + \a T M'f3\ \a T Mu\ - 
i 1 c 1 11 1 c 



a T M(3 

T 



< IIMIL logi + IIM'll IIMIL ~, 



c c 



(A.2) 

where (a) is the same as above and (6) follows from \x I My\ < \\xW-y WMW^ WyW^- □ 
Lemma A.2. // the HMP is e-primitive for e > 0, then for some 7 < 1 and C < 00 we have 

E [a T M'(y)f3\og {a T M(y)(3) - a?M'(y)[3 j+1 log (aj M(y)(3 j+1 )] < 2L a C 1 ^ 1 + 2L fj C 1 n ^ +1 , 

yey 

where c(y) = S 2 , and 



L, 



E 



yey 



\\M{y)\\ x - \og^- + WM'MW, WMiy)^ -^ f - 
77 c{y) Tc{y) 

ll^(y)l!ooiog^y + l|M'(y)IL H^MIL ^ 



The expectation assumes that a, ft are drawn from their respective stationary distributions while aj,/3 J+ i 
are drawn from the distributions implied by an arbitrary initialization of a±, f3 n +i- 

Proof. Since the HMP is e-primitive for e > 0, there is a S such that min^ en > 5 and min^ (3i > 5 
on the entire support of a, ft. It also follows that rj — mini > 0. Now, consider the function 
F y (a, f3) = —a T M'(y)(3\og (a T M(y)0}. Under these conditions, Lemma [A. 1 1 shows that this function is 
Lipschitz continuous w.r.t. on the support of a, ft with Lipschitz constants L a (y) and Lp{y) defined 
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by generalizing (|A. 1|) and (|A.2[) . Therefore, we can write 

E a ,p [F y (a, (i) - F y (a 3 ,P 3+1 )} < ]T E a>p [L a (y) \\a - + L p (y) \\0 - ft+i ||J 



yey 



yey 



(a) 



< [L a (y)2d(a,a j ) + L (y)2d(0,0 j+1 )} 



yey 

< LoiyW- 1 + Lf,(y)2C~, n - j+1 , 
yey yey 

where (a) follows from Lemma l2~4l and (b) follows from Lemma [2. 121 because the HMP is e-primitive. □ 
A. 2 Proof of Lemma 13.41 

Proof. The first two results follow from Lemmas 12.211 and 12.251 Substituting and integrating gives and 

/ fi(da)a(q) = / H q {da) = Pr(Q = q) 
Ja JAo v v ' 



and 



A 



Pr(Q=q,a>eda) 
1 



v 9 (dfi) = 1. 



Pr(Q=q,/3£dl3) 

Using the fact that 

yey 

we can evaluate the third and fourth results with 



and 



yey 

Finally, the fifth result follows from 
d 
dl 



[ fi(da)a T E M(y)P = ir T Pf3 = ir T f3 = 1 
a T V M(y) I v{dfi)P = a T Pl = a T l = 1. 



[ n(da)a T Y,Me(y) [ v(0)fi = Pgl = -£-1 = 0. 

JAo JAo ™ d0 



□ 



A.3 Proof of Theorem [3761 



Proof. First, we point out that lining. Mg(y) = s(y) P implies that output symbols provide no state 
information at = 0* so that H(y; 6*) = H(Y 1 ;6*). This also implies that, at 6 = 6*, the forward and 
backward Blackwell measures are Dirac measures, fi(A) — 1a{^) and v{B) = 1^(1), concentrated on 
7r, 1. By Theorem 13.21 the derivative of the entropy rate is uniformly continuous on D and we have 



lim — H{y- 9) = - lim E aJj 



Y,a T M' e {y)f5\n(a T M e {y)f5 



yey 



£ 7r T M'(y)l In (s(y)) - n T £ M '^) 1 ln 



yey 



x yey 



(a) 



Y^^M'iy)! ln( S (y)) 

yey 
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where (a) holds because n T PI = 1. 

For the 2nd derivative, we apply the derivative shortcut a second time by noting that 

n 71 Ft f) 

i=l j = l ■> 

Applying this to g n (6i, . . . , 9 n ) for the entropy rate gives 



n 

(a) _ 1 

n 



1 n n a a / n \ 



i=l i=l 



<9 



tt t nM 9t (y t ) 1 



vt=l 



E^E E ^M^M^M^Jl-log 



(0i,...,9„)=(0V..0*) 
t T f[M et (!/t) 1 



(Oi 



i=i j=i 3 y^ey™ \t=i / 

n 

--EE 7r T M(» 3 r 1 )M"(y j )M(^ 1 )l.Iog 



(9 1 ,...,e„)=(e*,...e*) 

^ t n%) i 



72 ^ ' ^ ' 

r, n J -1 



^ T (nr=i^(y*))i 



(A) 
(Tl) 

(T2) 



n 



EE E ^^(yi-'jM'd/ijM^jM'^oM^ji.iog 



n 



j=l i=l j/feyn 

EE E 

j=i i=i yfe^" 



Vt=l 



^ T (nr=i^))i 



(T3) 
(T4) 



where the term labeled (A) is zero because it equals — — -pol. Using the term labels in the equation 
(i.e., T1,T2,...), we see that g%(9*) = T x + T 2 + T 3 + T 4 , where the terms T U T 2 are associated with 
i = j, and the terms T 3 ,T4 are associated with i ^ j. Using this decomposition, we can reduce each 
term separately. 

For the first term, M(y) = s(y)P implies that 



n 

T i=--E E 7 r T M(yr 1 )M"( % )M(^ +1 )l.log 



j=i y ?ey™ 



* r (l[M(y t ) 1 



^E E ^ T M"(, ( i:u]o,(.s(,/i;)) 



n — ' '■ — ' s(w,) 



= -^E E (^ T ^ / fe)l-log(^)) + ^ T M''( % )l £ log («(»*)) 



k=l,k=tj 



(a) _ 1 

n 



-E E 7rTM "^) 1 - 1 °g( s ^))+ 

E 7 r T M"( 2 ;)l.log( S (2 / )), 

yey 
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where (a) follows from the fact that 

Efi^'W 1 E lo g(^))=( n E iog(«(» fc ))| e- Tm "(%) 1 = - 

For the second term, M(y) = s(y)P implies that 



n h „n^n ^ (!!"=! M (yt)) 1 



1 



1 " (^M'fe)l) 2 

(n T M'(y)l) 2 

2-^ s(y) 
yey yy ' 

= (n T M'{y)l) 2 



yey xy ' 

For the third term, we notice first that ^2 y ^y M'(y) = implies 



J2 ^M'iy^-^M'iy,)! ■ log (s(y k )) = 

yi,Vj,yk£y n 



if either i ^ k or j ' ^ k. This gives 

n T [l[M et (y t ))l 



2 " 

T 3 = --EE E ^M^r^M'^OM^^M'^OM^^l-log 

j=l i=l y^6^" 

i=i i=i j^ey ^ W7 

= - - E E E E ^'(yO^'-'-'M'^Ol • log (s(y k )) 
k=l j=l i=l yi,yj,y k ey n 

=0 

because z < j. 

For the fourth term, we have 



\t=i 



2 " ^ (^Miyt^M'iy^MiyJ^)^ 

T4 = ~"hh v h« - T (n^M (yt ))i 

i=i *=i yi-.yjey™ 
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because Y, v ey M '(v) = °- n 
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